爬虫笔记4

一些在爬虫中用到的python tricks

string.strip([c])

//This method returns a copy of the string in which all chars have been stripped from the beginning and the end of the string.

返回一个string,是原string的copy但删除了所有的c字符。

Reference: https://www.tutorialspoint.com/python/string_strip.htm

.strip() removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.

Reference: https://stackoverflow.com/questions/13013734/string-strip-in-python

1
str.split(str="", num=number)

//str是分隔符,默认是space

举例说明吧,

1
2
3
str = "Line1-abcdef \nLine2-abc \nLine4-abcd";
print str.split( )
print str.split(' ', 1 )

结果:

1
2
['Line1-abcdef', 'Line2-abc', 'Line4-abcd']
['Line1-abcdef', '\nLine2-abc \nLine4-abcd']

Reference: https://www.tutorialspoint.com/python/string_split.htm

Unicode

A character is not, not, not a byte. a character is the platonic ideal of the smallest unit of textA. character encoding defines a mapping between our platonic characters and some way of representing them as bytes. Because there are many ways of representing the same character as bytes, this means that if you have a series of bytes, but do not know their encoding - even if you know the data is textual - the data is meaningless. First thing before everything, is knowing the encoding.

In python, there are three distict string types:

  1. ‘unicode’, which represents unicode strings (text strings).
  2. ‘str’, which represents byte strings (binary data).
  3. ‘basestring’, which acts as a parent class for both of the other string types.

Conversion between the two types is meaningless without an encoding, Python relies on a ‘default encoding’, specified by sys.setdefaultencoding().

Simply set encoding with function sys.setdefaultencoding() is a solution but may not that good, since the web may use multiple different text encoding.

Here is a correction solution, referenced from: http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python

  • All text strings, everywhere should be of type unicode, not str. If you’re handling text, and your variable is a str, it’s a bug!
  • To decode a byte string as text, use var.decode(encoding) (eg, var.decode('utf-8'), with the correct encoding. To encode a text string as bytes, use var.encode(encoding).
  • Never ever use str() on a unicode string, or unicode() on a byte string without a second argument specifying the encoding.
  • Whenever you read data from outside your app, expect it to be bytes - eg, of type str - and call .decode() on it to interpret it as text. Likewise, always call .encode() on text you want to send to the outside world.
  • If a string literal in your code is intended to represent text, it should always be prefixed with ‘u’. In fact, you probably never want to define a raw string literal in your code at all. For what it’s worth, though, I’m terrible at this one, as I’m sure pretty much everyone else is, too.

Usually in python 2, for a web crawler python file, I see a lot:

1
2
reload(sys)
sys.setdefaulyencoding('utf8')

But in python3, default is UTF-8 already. No point to write this again.